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1^ . Summary. We demonstrate how information in the form of observable data 

. ' and moment constraints are introduced into the method of Maximum relative 

Entropy (ME). A general example of updating with data and moments is 
shown. A specific econometric example is solved in detail which can then be 
used as a template for real world problems. A numerical example is compared 
to a large deviation solution which illustrates some of the advantages of the 
ME method. 
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1 Introduction 



I The MaxEnt method [T] was designed to assign probabihties. This 

■ method has evolved to a more general method, the method of Max- 

Q>^ , imum (relative) Entropy (ME) [21 EJ |3] which has the advantage of not 

' only assigning probabilities but updating them when new information 

• is given in the form of constraints on the family of allowed posteriors. 

The main purpose of this paper is to show both general and specific ex- 
amples of how the ME method can be applied using data and moment 
constraints. 

The two preeminent updating methods are the ME method and 
^ ' Bayes' rule. The choice between the two methods has traditionally been 

dictated by the nature of the information being processed (either con- 
straints or observed data) but questions about their compatibility are 
regularly raised. Our first objective is to review how data is introduced 
into the ME method. 

Next we show a general example of updating with two different forms 
of information: moments and data. The solution resembles Bayes' Rule. 
The difference between this solution and the traditional Bayes form 
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results from using the moment constraint. This constraint modifies the 
usual Bayesian likelihood. In an effort to put some names to these 
pieces we will call the standard Bayesian likelihood the likelihood and 
the part associated with the moment the likelihood modifier so that 
the product of the two yields the modified likelihood. We extend this 
general example by solving a specific ill-behaved econometric problem 
in detail, which can then be used as a template for real world problems. 
Numerical solutions are produced to explicitly illustrate the case. 

Recently, ill-behaved problems have been solved using large devi- 
ation theory or information-theoretic approaches. All of these meth- 
ods have a common premise: they rely on asymptotic arguments. The 
ME method does not need such assumptions to work and therefore 
can process finite amounts of data well. However, when ME is taken to 
asymptotic limits one recovers the same solutions that the information- 
theoretic methods produce. This is discussed by comparing the numer- 
ical solution to our specific example and the solution that is attained 
by the method of types 

2 Updating with data using the ME method 

Our first concern when using the ME method to update from a prior 
to a posterior distribution is to define the space in which the search 
for the posterior will be conducted. We wish to infer something about 
the values of one or several quantities, 9 € 0, on the basis of three 
pieces of information: prior information about 9 (the prior), the known 
relationship between x and 9 (the model), and the observed values of 
the data x € X. Since we are concerned with both x and 9, the relevant 
space is neither X nor but the product X x0 and our attention must 
be focused on the joint distribution P(x, 9). The selected joint posterior 
-fnew(a;, 9) is that which maximizes the entropy, 



subject to the appropriate constraints. Poi(i{x,9) contains our prior 
information which we call the joint prior. To be explicit, 



where -Poid(^) is the traditional Bayesian prior and Poid{x\9) is the like- 
lihood. It is important to note that they both contain prior information. 
The Bayesian prior is defined as containing prior information. However, 




(1) 



PMx,9)=PMe)Poid{x\9) , 



(2) 
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the likelihood is not traditionally thought of in terms of prior informa- 
tion. Of course it is reasonable to see it as such because the likelihood 
represents the model (the relationship between and x) that has al- 
ready been established. Thus we consider both pieces, the Bayesian 
prior and the likelihood to be prior information. 

The new information is the observed data, x', which in the ME 
framework must be expressed in the form of a constraint on the allowed 
posteriors. The family of posteriors that reflects the fact that x is now 
known to be x' is such that 



This amounts to an infinite number of constraints: there is one con- 
straint on P{x, 9) for each value of the variable x and each constraint 
will require its own Lagrange multiplier \{x). Furthermore, we impose 
the usual normalization constraint, 




(3) 





Maximize S subject to these constraints, 





and the selected posterior is 




Po\A{x,e) 



(6) 



Z 



where the normalization Z is 






(8) 



(9) 



The new marginal distribution for 6 is 



4 Adorn Giffin 



Pnew(^) = j dxPnew(x,^) = Pold(^k') • (10) 

This is the famiUar Bayes' conditionaUzation rule. To summarize: 
Poidix,6) = -Poid(a;)-Poid(^'|2;) is updated to 

with P^e^{x) = S{x — x') fixed by the observed data while Pn(,^{9\x) = 
-Poid(^la^) remains unchanged. We see that in accordance with the min- 
imal updating philosophy that drives the ME method one only updates 
those aspects of one's beliefs for which corrective new evidence (in his 
case, the data) has been supplied. 



3 Data and a moment 

In this general example, we extend our results from the previous section. 
Again we wish to infer something about 0, given some information. The 
information that we are given in this example is some observed data, 
x' and a constraint on the posterior in the form of a moment. Here we 
apply the data constraint simultaneously with the moment constraint. 
Note that this problem cannot be solved by MaxEnt or Bayes. For this 
example, we assume the constraints, 

/...«P(.,«) = 1, (U) 

which is our normalization constraint, 

j deP{x, 9) = 5{x -x)= P{x) , (12) 

which represents some observable data, 

j dxdOPix, e)f{e) = if (6)) = F , (13) 

which represents some additional information. Maximizing the entropy 
given the constraints with respect to P{x, 9) yields, 

Pnew(x, 6) = |Pold(x, 0)e^(-)+/5/('') , (14) 

^ Use of a (5 function has been criticized in that by implementing it, the probability 
is completely constrained, thus it cannot be updated by future information. This 
is certainly true! However, imposing one constraint does not imply a revision of 
the other: An experiment, once performed and its outcome observed, cannot be 
un-performed and its result cannot be un-observed by subsequent experiments. 
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where Z is determined by using 

Z = e-"+i = J dxdee^^''^+'^f^'^^Po\A{x,e) (15) 

and the Lagrange multipliers A(x) are determined by using ()12p 

e^(^) = -r ^777? ; zKx - x) . (16) 

The posterior now becomes 



-fnew {x, 0) 



C{x,P) 



Poidix, 0)6{x - x)e 



(17) 



where C{x,(i) = j dOe^^^^^ Poxs.{x,9). 

The Lagrange multiplier (3 is determined by first substituting the 
posterior into (fT3l) 



dxdO 



^^Poid(x,0)<5(x-x>^^W 



f{e) = F , (18) 



which can be rewritten as 
1 



Integrating over x yields, 

/dge/^/WPold(:^^g)/(0) 
C(x',/3) 



5(x - = F . (19) 



F 



(20) 



where C(x,/3) ^ C(2;',/3) = / fiee'3/WPoid(x', 0). Now (3 can be deter- 
mined by 

91nC(x',/3) _ ^ ^21) 



a/3 



The final step is to marginalize the posterior, Pnew(x,^) to get our 
updated probability, 



C(x',/3) 



(22) 



Additionally, this result can be rewritten using the product rule as 



1 



^'new(^) = jyT^PoM{0)PoM{x'\e)el'f^'^ 

C, [X , p) 



(23) 
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where C'{x',l3) = J dde^f'^^^Poid{0)Po\d{x'\O). The right side resembles 
Bayes theorem, where the term Po\d{x'\0) is the standard Bayesian hke- 
hhood and Poid(^) is the prior. The exponential term is a modification 
to these two terms. In an effort to put some names to these pieces we 
will call the standard Bayesian likelihood the likelihood and the ex- 
ponential part the likelihood modifier so that the product of the two 
gives the modified likelihood. The denominator is the normalization or 
marginal modified likelihood!^ 

4 The econometric problem 

This is a general example of an ill-posed problem using the above 
method: A factory makes k different kinds of bouncy balls. For refer- 
ence, they assign each different kind with a number, /i,/2, ■■■fk- They 
ship large boxes of them out to stores. Unfortunately, there is no mecha- 
nism that regulates how many of each ball goes into the boxes, therefore 
we do not know the amount of each kind of ball in each or all of the 
boxes. However, we are informed that the company does know the av- 
erage amount of balls, F in each of the boxes over the time that they 
have been in existence. What is the probability of getting a particular 
kind of ball in one of the boxes? At this point one could use MaxEnt to 
answer the question, assuming that the 'average' could be substituted 
for the moment constraint. Now let us complicate the problem by sug- 
gesting that we would like a better idea of how many balls are in each 
box (perhaps for quality control or perhaps the customer would like 
more of one kind of ball than another) . To do this we randomly select 
a few balls, n from a particular box and count how many of each kind 
we get, mi,m2---mk (or perhaps we simply open the box and look at 
the balls on the surface). Now let us put the above example in a more 
mathematical format. 

Let the set of possible outcomes be represented by, k = {/i, /2, ...fk} 
from a sample where the total number of balls, — > cxQ and whose 
sample average is F. Further, let us draw a data sample of size n, from 
the original sample whose outcomes are counted and represented as 
m = {mi,m2---mk) where n = X^i^™-*- We would like to determine the 

Including an additional constraint in the form of J dxddP{x,6)g{x) = (g) — G 
could only be used when it does not contradict the data constraint (|12p . Therefore, 
it is redundant and the constraint would simply get absorbed when solving for 
X{x). 

* It is not necessary for N —* oo for the ME method to work. We simply wish to use 
the description of the problem that is common in information-theoretic examples. 
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probability of getting any particular outcome in one draw (6i) given 
the information. To discuss the probabilities related to this situation, 
we implement observational data simultaneously with an expectation 
value. We start with the usual negative relative entropy for the joint 
space, 

5[p.PoMi = -E/^«n™,%)iog£=^. (24) 

We also have the following constraints, 

^ [ de P{m,e\n) = 1 , (25) 

P{m\n) = j dO P{m,e\n) = Smm' , (26) 
[ dePim, e\n)f{0) = {f{0)) = F , (27) 

where 9 = {6i,62---0k), rn = {mi...mk) and m' is the observed data. 
Notice the use of the Kronecker for the discrete case. Now we maxi- 
mize the entropy given the constraints with respect to P{m, 6\n) which 
yields. 

We need to determine Po\^{m'\9^n) and -Pold(^|'^) for our problem. 
The equation that we will use for the likelihood, Po[d{m'\0,n) is simply 
the multinomial distribution, 

Pold(m;...m',|0i...efc,n) = , , • (29) 

m\\...mj^\ 

Prior to receiving the information that the die is not fair due to the 
bias, we were completely ignorant of the status of the die. Therefore to 
incorporate this ignorance we use a prior that is flat, thus Po\d{9\n) = 
constant. Being a constant, the prior can come out of the integral and 
cancels with the same constant in the numerator. (Also, the particular 
form of -Pold(^|"') is not important for our current purpose so for the 
sake of definiteness we can choose it flat for our example.) 

Now we include our average information. To do this, we rewrite 
the moment constraint ([27j) to reflect the special case by replacing the 
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function f{9) with fiOi where fi is a discrete parameter that reflects 
the label for the outcomes and F is the average. The sum relates the 
relationship of the sides and 9i is the continuous parameter that we 
wish to infer something about. Thus the constraint is rewritten the 
following way. 



» / k \ k 

J2 / deP{mi,0i...mkek\n) fiO, 5(^9^ - I) = F 

M \ i / i 



(30) 



where, 

n n / k \ 

^ = ^ ... ^ (5 I -n I and d9 = d9i...d9k (31) 

M mi=0 mfc=0 \ 1 / 

Notice that F reflects the average relationship of the sides. 

The resulting posterior is the product of the likelihood and what 

we have called the likelihood modifier, e^-^^^^ or in this case, e^^» 
divided by the normalization of the two, 

PneAOl-Ok) = " 1) J] e''^^'" " (32) 

^ i 1=1 

where ( = J d96iZi &^ " 1) Elti e^^^^^C^. 

To determine (3 we use ()2ip . This function can be complicated. One 
may need to find a numerical solution for (3 or an advanced search 
technique such as Newton's method. 

For simplicity we reduce the final -Pncw(^) to A; — 1 dimensions. 



1 PhU-f:eA n-E'm,^l 



where C' = J dOe^f^^^-^"" {I - Y^-H^r'^t' U-=i e^^^'^9f\ 



4.1 Solving the normalization factor 

The denominator, (^', which is the normalization factor, can be a dif- 
ficult integral. The general solution for the k sided die is a hypergeo- 
metric series which is calculated on a — 1 simplex. 



C' = e^^'=/i(/2(...(4-i))), 



(33) 
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-30 -20 -10 ^0 10 20 30 

Fig. 1. This figure shows the relationship between (3 and <P{(1> — F {(])). 
Notice that as the value for ^ approaches the extremities of the outcomes, /3 
approaches infinity. 



where 

I,=r(5,-a,)f; with 4 = 1 (34) 

and where aj = rn'j,_j + 1, bj=n + j + l + Y,iZo Qi - Yli=o~^ m-(the 
terms qo and uiq = 0), tj = P {fk-j — fk), /? is the Lagrange multiplier 
and, fi and fk comes from r[...) is the gamma function, and the terms 
go and uiq = 0. The index j takes all discrete values from 1 to /c — 1. 
The total number of counts or rolls of the die is n, with being the 
amount of counts for each parameter or dimension, thus n = X^^Li n-i- 
The summation terms for each level of this nested series are represented 
by Qj . The factory information is codified in tj , where /? is the Lagrange 
multiplier and, /j and comes from (|30p . 

A few technical details are worth mentioning: First, one can have 
singular points when tj = 0. In these cases the sum must be evaluated 
in the limit as tj 0. Second, since aj and bj are positive integers 
the Beta functions involve no singularities. Lastly, the sums converge 
because aj > bj. 

4.2 Numerical solutions 

We will extend the econometric example by applying the above solu- 
tions to a specific problem where there are three kinds of balls labeled 
1, 2 and 3. So for this problem we have /i = 1, /2 = 2 and /s = 3. Fur- 
ther, we are given information regarding the average of all the boxes, F. 
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For our example this average will be, F = 2.3. Notice that this implies 
that on the average there are more 3's in each box. Next we take a 
sample of one of the boxes where m'l = 11, m'2 = 2 and ra'^ = 7. The 
numerical solution for this example is, 

Pe(^i,^2) = le^(-2^-^^+3)0li0i(l -9,- 62)' , (35) 

where /3 = 14.1166 and C' = 1874.1247. We show the relationship 
between /3 and F in Fig 1. The purpose of the Lagrange multiplier is 
to enforce the moment constraint, therefore, as F goes to the extreme 
(F — > 3), /3 — > 00. This is important to mention because it graphically 
illustrates that whether the deviation from the sample mean is large or 
small, the ME method holds. 

Another possible method suggested to use for this problem is the 
method of types [7]. This method essentially uses a form of Sanov's 
theorem, which for this problem would be written as, 

where Q{9i) is "estimated" with the frequency of the data sample. Thus 
Q{9i) = vi = 11/20, etc. This produces the following results: 

6*41 = 0.3015, 9t2 = 0.0971, 9fi = 0.6015. (37) 

Taking the means of the ME solution yields, 

{9i) = 0.2942, {92) = 0.1115, {9^) = 0.5942. (38) 

Clearly the numerical solutions are very close, however, there are several 
flaws with this large deviation method. The first is that Q is treated 
as a frequency. In the asymptotic case it would be appropriate to use 
a frequency, unfortunately this is not that case. The data sample is 
finite, n = 20. Another flaw is that the method does not allow for 
fluctuations where as the ME method does. Of course in the asymptotic 
case, fluctuations would be ruled out, but again, this is not the case. 
There is an underlying theme here: probabilities are not equivalent to 
frequencies except in the asymptotic case. 
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5 Conclusions 

Using the ME method we were able to use information in the form of 
data and moments to update our prior probabilities. A general example 
was shown where the solution resembled the traditional form of Bayes 
rule with the standard likelihood being modified by a factor resulting 
from the moment constraint. 

A specific econometric example was then solved in detail to illustrate 
the application of the method. This case can be used as a template for 
real world problems. Numerical results were obtained to ilhistratc ex- 
plicitly how the method compares to other methods that are currently 
employed. The ME method was shown to be superior in that it did 
not need to make asymptotic assumptions to function and allows for 
fluctuations. 

It must be emphasized that in the asymptotic limit, the ME form is 
analogous to Sanov's theorem. However, this is only one special case. 
The ME method is more robust in that it can also be used to solve 
traditional Bayesian problems. In fact it was shown that if there is no 
moment constraint, one recovers Bayes rule. 

Therefore, we would like to emphasize that anything one can do 
with Bayes, one can now do with ME. Additionally, in ME one now 
has the ability to apply additional information that Bayesian methods 
could not. Further, any work done with Bayesian techniques can be 
implemented into the ME method directly through the joint prior. Fi- 
nally the ME method can now also be used to solve ill-posed problems 
in econometrics. 
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